feat(benchmark): add benchmark command with pipeline metrics and PR analysis by echobt · Pull Request #14 · CortexLM/swe-forge

echobt · 2026-02-17T18:30:03Z

Summary

Add a new benchmark CLI command that evaluates the SWE pipeline against a batch of PRs, collecting detailed metrics on filtering, difficulty distribution, quality, and throughput. Results are persisted as JSON and documented in the README.

Changes

New benchmark subcommand (src/cli/commands.rs): runs the pipeline on a configurable number of PRs (default 100) and outputs structured results including:
- Total PRs processed, filtered count, and filter rate
- Difficulty breakdown (easy / medium / hard)
- Quality classification of accepted PRs
- Throughput metrics (requests/second, average processing time)
BenchmarkMetrics tracking (src/swe/pipeline.rs, src/swe/orchestrator.rs): instrument the pipeline to capture per-PR timing, filtering decisions, and difficulty classification during benchmark runs
Benchmark output artifacts (benchmark-output/, benchmark_results.json, benchmark_output.json): sample benchmark results across 8 repositories covering Go, Java, Python, and Rust projects
README.md: add comprehensive benchmark results section documenting filtering rates, difficulty distribution, quality metrics, and performance characteristics

Notes

No breaking changes to existing CLI commands or pipeline behavior
Benchmark mode is opt-in via the new benchmark subcommand
All 300 existing tests continue to pass

…tion Add a new `swe benchmark` CLI subcommand that runs N candidate PRs through the full mining pipeline and outputs detailed metrics as JSON. The benchmark exercises the complete flow: GH Archive ingestion → enrichment → filtering → LLM classification → patch extraction → Docker-based agentic test generation → quality scoring → export. Code changes in src/cli/commands.rs: - Added SweBenchmarkArgs struct with configurable parameters (count, min-stars, languages, model, api-key, cache-db, output directory) - Added Benchmark variant to SweSubcommand enum - Implemented run_swe_benchmark_command async handler that validates API keys, configures SweOrchestrator, runs the pipeline, and outputs JSON results README.md updated with comprehensive English benchmark results from a run processing 100 PRs (2026-02-17), including: - Pipeline funnel (1.75M raw events → 8 accepted tasks, 0.00046% yield) - Difficulty distribution (81.8% medium, 18.2% easy, 0% hard) - Quality metrics (avg 0.47, pass rate 72.7%, threshold ≥0.30) - Throughput/timing (21 PRs extracted/hr, 8 accepted/hr, 171.4s avg per PR) - Language distribution (Go 37.5%, Java 25%, Python 25%, TypeScript 12.5%) - Accepted task listing with scores - Test generation failure analysis - Usage instructions for running the benchmark Benchmark artifacts added: - benchmark-output/ with 8 accepted task directories, each containing workspace.yaml, checks.txt, prompt.md, original_pr.md, and test scripts - benchmark_output.json and benchmark_results.json with raw pipeline output - benchmark_clean.log with pipeline execution log

…nalysis (#14) * feat(swe): add BenchmarkMetrics tracking to pipeline and orchestrator * feat(benchmark): add benchmark command and pipeline metrics documentation Add a new `swe benchmark` CLI subcommand that runs N candidate PRs through the full mining pipeline and outputs detailed metrics as JSON. The benchmark exercises the complete flow: GH Archive ingestion → enrichment → filtering → LLM classification → patch extraction → Docker-based agentic test generation → quality scoring → export. Code changes in src/cli/commands.rs: - Added SweBenchmarkArgs struct with configurable parameters (count, min-stars, languages, model, api-key, cache-db, output directory) - Added Benchmark variant to SweSubcommand enum - Implemented run_swe_benchmark_command async handler that validates API keys, configures SweOrchestrator, runs the pipeline, and outputs JSON results README.md updated with comprehensive English benchmark results from a run processing 100 PRs (2026-02-17), including: - Pipeline funnel (1.75M raw events → 8 accepted tasks, 0.00046% yield) - Difficulty distribution (81.8% medium, 18.2% easy, 0% hard) - Quality metrics (avg 0.47, pass rate 72.7%, threshold ≥0.30) - Throughput/timing (21 PRs extracted/hr, 8 accepted/hr, 171.4s avg per PR) - Language distribution (Go 37.5%, Java 25%, Python 25%, TypeScript 12.5%) - Accepted task listing with scores - Test generation failure analysis - Usage instructions for running the benchmark Benchmark artifacts added: - benchmark-output/ with 8 accepted task directories, each containing workspace.yaml, checks.txt, prompt.md, original_pr.md, and test scripts - benchmark_output.json and benchmark_results.json with raw pipeline output - benchmark_clean.log with pipeline execution log * ci: trigger CI run

echobt added 3 commits February 17, 2026 17:20

feat(swe): add BenchmarkMetrics tracking to pipeline and orchestrator

40547d7

ci: trigger CI run

10e3718

echobt merged commit 4d63a7a into main Feb 17, 2026
9 checks passed

echobt deleted the feat/benchmark-pipeline-metrics branch February 17, 2026 18:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(benchmark): add benchmark command with pipeline metrics and PR analysis#14

feat(benchmark): add benchmark command with pipeline metrics and PR analysis#14
echobt merged 3 commits intomainfrom
feat/benchmark-pipeline-metrics

echobt commented Feb 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

echobt commented Feb 17, 2026

Summary

Changes

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant